Learning to Crawl: Classifier-guided Topical Crawlers
نویسندگان
چکیده
Topical or focused crawlers follow the hyperlinked structure of the Web guided by the scent of information to identify and harvest topically relevant pages. For sniffing the appropriate scent they mine the content of pages that are already fetched to prioritize the fetching of unvisited pages. Topical crawling is currently a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. Sporadically, the use of classification algorithms to guide topical crawlers has been suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. We also explore the effects of various techniques for deriving contexts of hyperlinks on crawling performance. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the graph (i.e., the Web). We have designed and developed a crawling framework that allows for flexible addition of new classifiers. The crawlers themselves are implemented as multi-threaded objects that run concurrently. Our results show that Naive Bayes is a weak choice for guiding a topical crawler. We also find that a crawler that exploits words both in immediate vicinity of a hyperlink as well as the entire parent page performs better than a crawler that depends on just one of those cues. Also, a crawler that uses the tag tree hierarchy within Web pages provides effective coverage. We
منابع مشابه
Expanding Reinforcement Learning Approaches for Efficient Crawling the Web
The amount of accessible information on World Wide Web is increasing rapidly, so that a general-purpose search engine cannot index everything on the Web. Focused crawlers have been proposed as a potential approach to overcome the coverage problem of search engines by limiting the domain of concentration of them. Focused crawling is a technique which is able to crawl particular topical portions ...
متن کاملDefining Evaluation Methodologies for Topical Crawlers
Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. We have argued that general evaluation methodologies a...
متن کاملCombining Classifier Guided by Semi-Supervision
The article suggests an algorithm for regular classifier ensemble methodology. The proposed methodology is based on possibilistic aggregation to classify samples. The argued method optimizes an objective function that combines environment recognition, multi-criteria aggregation term and a learning term. The optimization aims at learning backgrounds as solid clusters in subspaces of the high...
متن کاملCombining Classifier Guided by Semi-Supervision
The article suggests an algorithm for regular classifier ensemble methodology. The proposed methodology is based on possibilistic aggregation to classify samples. The argued method optimizes an objective function that combines environment recognition, multi-criteria aggregation term and a learning term. The optimization aims at learning backgrounds as solid clusters in subspaces of the high...
متن کاملTopical Crawling for Business Intelligence
The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Generalpurpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-dept...
متن کامل